Theory of Deep Learning III: Generalization Properties of SGD

نویسندگان

  • Chiyuan Zhang
  • Qianli Liao
  • Alexander Rakhlin
  • Brando Miranda
  • Noah Golowich
  • Tomaso Poggio
چکیده

In Theory III we characterize with a mix of theory and experiments the consistency and generalization properties of deep convolutional networks trained with Stochastic Gradient Descent in classification tasks. A present perceived puzzle is that deep networks show good predicitve performance when overparametrization relative to the number of training data suggests overfitting. We describe an explanation of these empirical results in terms of the following new results on SGD: 1. SGD concentrates in probability like the classical Langevin equation – on large volume, “flat” minima, selecting flat minimizers which are with very high probability also global minimizers. 2. Minimization by GD or SGD on flat minima can be approximated well by minimization on a linear funcion of the weights suggesting pseudoinverse solutions. 3. Pseudoinverse solutions are known to be intrinsically regularized with a regularization parameter λ which decreases as 1 T where T is the number of iterations. This can qualitatively explain all the generalization properties empirically observed for deep networks. 4. GD and SGD are connected closely to robust optimization. This provides an alternative way to show that GD and SGD perform implicit regularization. These results explain the puzzling findings about fitting randomly labeled data while performing well on natural labeled data. They also explains while overparametrization does not result in overfitting. Quantitative, non-vacuous bounds are still missing, as it has almost always been the case for most practical applications of machine learning. We describe in the appendix an alternative approach that explains more directly, with tools of linear algebra the same qualitative properties and puzzles of generalization in deep polynomial networks. This is version 2. The first version was released on 04/04/2017 at https://dspace.mit.edu/handle/1721.1/107841.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Musings on Deep Learning: Properties of SGD

We ruminate with a mix of theory and experiments on the optimization and generalization properties of deep convolutional networks trained with Stochastic Gradient Descent in classification tasks. A present perceived puzzle is that deep networks show good predictive performance when overparametrization relative to the number of training data suggests overfitting. We dream an explanation of these...

متن کامل

Generalization Error Bounds with Probabilistic Guarantee for SGD in Nonconvex Optimization

The success of deep learning has led to a rising interest in the generalization property of the stochastic gradient descent (SGD) method, and stability is one popular approach to study it. Existing works based on stability have studied nonconvex loss functions, but only considered the generalization error of the SGD in expectation. In this paper, we establish various generalization error bounds...

متن کامل

Averaging Weights Leads to Wider Optima and Better Generalization

Deep neural networks are typically trained by optimizing a loss function with an SGD variant, in conjunction with a decaying learning rate, until convergence. We show that simple averaging of multiple points along the trajectory of SGD, with a cyclical or constant learning rate, leads to better generalization than conventional training. We also show that this Stochastic Weight Averaging (SWA) p...

متن کامل

The Regularization Effects of Anisotropic Noise in Stochastic Gradient Descent

Understanding the generalization of deep learning has raised lots of concerns recently, where the learning algorithms play an important role in generalization performance, such as stochastic gradient descent (SGD). Along this line, we particularly study the anisotropic noise introduced by SGD, and investigate its importance for the generalization in deep neural networks. Through a thorough empi...

متن کامل

Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data

One of the defining properties of deep learning is that models are chosen to have many more parameters than available training data. In light of this capacity for overfitting, it is remarkable that simple algorithms like SGD reliably return solutions with low test error. One roadblock to explaining these phenomena in terms of implicit regularization, structural properties of the solution, and/o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017